RoadRunner: Towards Automatic Data Extraction from Large Web Sites

نویسندگان

  • Valter Crescenzi
  • Giansalvatore Mecca
  • Paolo Merialdo
چکیده

The paper investigates techniques for extracting data from HTML sites through the use of automatically generated wrappers. To automate the wrapper generation and the data extraction process, the paper develops a novel technique to compare HTML pages and generate a wrapper based on their similarities and differences. Experimental results on real-life data-intensive Web sites confirm the feasibility of the approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Web Information Extraction in the ROADRUNNER System

This paper presents roadRunner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites with a fairly complex structure, that publish large amounts of data. The paper describes the top-level software architecture of the roadRunner System, and the novel res...

متن کامل

The ROADRUNNER Project: Towards Automatic Extraction of Web Data

ROADRUNNER is a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites that publish large amounts of data in a fairly complex structure. In our view, we aim at ideally seeing the data extraction process of a data-intensive Web site as a black-box taking as ...

متن کامل

The RoadRunner Web Data Extraction System

Extracting data from HTML text files and making them available to computer applications is becoming of utmost importance for developing several emerging e-services. This paper presents RoadRunner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. We concentrate on data-intensive Web sites, that is, sites that deliver large amoun...

متن کامل

Automatic annotation of data extracted from large Web sites

Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sample pages, they can produce a common wrapper to extract relevant data. However, due to the automatic nature of the app...

متن کامل

RoadRunner for Heterogeneous Web Pages Using Extended MinHash

The Internet presents large amount of useful information which is usually formatted for its users, which makes it hard to extract relevant data from diverse sources. Therefore, there is a significant need of robust, flexible Information Extraction (IE) systems that transform the web pages into program friendly structures such as a relational database will become essential. IE produces structure...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001